AITopics | Saskatchewan

Collaborating Authors

Saskatchewan

DataMan: Data Manager for Pre-training Large Language Models

Peng, Ru, Yang, Kexin, Zeng, Yawen, Lin, Junyang, Liu, Dayiheng, Zhao, Junbo

arXiv.org Artificial IntelligenceMar-13-2025

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.19363

Country:

South America > Brazil > Ceará (0.14)
North America > Canada > Saskatchewan (0.14)
Europe > United Kingdom > England (0.14)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Sports (1.00)
Law (1.00)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.92)

Add feedback

Time series forecasting based on optimized LLM for fault prediction in distribution power grid insulators

Matos-Carvalho, João Pedro, Stefenon, Stefano Frizzo, Leithardt, Valderi Reis Quietinho, Yow, Kin-Choong

arXiv.org Artificial IntelligenceFeb-27-2025

Surface contamination on electrical grid insulators leads to an increase in leakage current until an electrical discharge occurs, which can result in a power system shutdown. To mitigate the possibility of disruptive faults resulting in a power outage, monitoring contamination and leakage current can help predict the progression of faults. Given this need, this paper proposes a hybrid deep learning (DL) model for predicting the increase in leakage current in high-voltage insulators. The hybrid structure considers a multi-criteria optimization using tree-structured Parzen estimation, an input stage filter for signal noise attenuation combined with a large language model (LLM) applied for time series forecasting. The proposed optimized LLM outperforms state-of-the-art DL models with a root-mean-square error equal to 2.24$\times10^{-4}$ for a short-term horizon and 1.21$\times10^{-3}$ for a medium-term horizon.

forecasting, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.17341

Country:

Asia (0.92)
North America > United States (0.28)
Europe > Switzerland (0.28)
(2 more...)

Genre: Research Report > Promising Solution (0.46)

Industry:

Energy > Power Industry (1.00)
Energy > Renewable > Wind (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Navigating AI to Unpack Youth Privacy Concerns: An In-Depth Exploration and Systematic Review

Shrestha, Ajay Kumar, Barthwal, Ankur, Campbell, Molly, Shouli, Austin, Syed, Saad, Joshi, Sandhya, Vassileva, Julita

arXiv.org Artificial IntelligenceDec-20-2024

This systematic literature review investigates perceptions, concerns, and expectations of young digital citizens regarding privacy in artificial intelligence (AI) systems, focusing on social media platforms, educational technology, gaming systems, and recommendation algorithms. Using a rigorous methodology, the review started with 2,000 papers, narrowed down to 552 after initial screening, and finally refined to 108 for detailed analysis. Data extraction focused on privacy concerns, data-sharing practices, the balance between privacy and utility, trust factors in AI, transparency expectations, and strategies to enhance user control over personal data. Findings reveal significant privacy concerns among young users, including a perceived lack of control over personal information, potential misuse of data by AI, and fears of data breaches and unauthorized access. These issues are worsened by unclear data collection practices and insufficient transparency in AI applications. The intention to share data is closely associated with perceived benefits and data protection assurances. The study also highlights the role of parental mediation and the need for comprehensive education on data privacy. Balancing privacy and utility in AI applications is crucial, as young digital citizens value personalized services but remain wary of privacy risks. Trust in AI is significantly influenced by transparency, reliability, predictable behavior, and clear communication about data usage. Strategies to improve user control over personal data include access to and correction of data, clear consent mechanisms, and robust data protection assurances. The review identifies research gaps and suggests future directions, such as longitudinal studies, multicultural comparisons, and the development of ethical AI frameworks.

artificial intelligence, data mining, social media, (18 more...)

arXiv.org Artificial Intelligence

2412.16369

Country: North America > Canada > Saskatchewan (0.28)

Genre:

Research Report (1.00)
Overview (1.00)
Instructional Material (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Educational Setting > K-12 Education (0.68)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
(3 more...)

Add feedback

Grounding Language in Multi-Perspective Referential Communication

Tang, Zineng, Mao, Lingjun, Suhr, Alane

arXiv.org Artificial IntelligenceOct-4-2024

We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another's visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 humanwritten referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success Figure 1: Example scene from our environment and when paired with a listener, resulting in dataset. The center image shows the speaker on the left an improvement from 58.9 to 69.3% in communicative and the listener on the right with their respective fields success and even outperforming the of view (FOV). The speaker refers to the target object, strongest proprietary model.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2410.03959

Country:

North America > United States > California (0.28)
North America > Canada > Saskatchewan (0.24)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

TabSQLify: Enhancing Reasoning Capabilities of LLMs Through Table Decomposition

Nahid, Md Mahadi Hasan, Rafiei, Davood

arXiv.org Artificial IntelligenceApr-15-2024

Table reasoning is a challenging task that requires understanding both natural language questions and structured tabular data. Large language models (LLMs) have shown impressive capabilities in natural language understanding and generation, but they often struggle with large tables due to their limited input length. In this paper, we propose TabSQLify, a novel method that leverages text-to-SQL generation to decompose tables into smaller and relevant sub-tables, containing only essential information for answering questions or verifying statements, before performing the reasoning task. In our comprehensive evaluation on four challenging datasets, our approach demonstrates comparable or superior performance compared to prevailing methods reliant on full tables as input. Moreover, our method can reduce the input context length significantly, making it more scalable and efficient for large-scale table reasoning applications. Our method performs remarkably well on the WikiTQ benchmark, achieving an accuracy of 64.7%. Additionally, on the TabFact benchmark, it achieves a high accuracy of 79.5%. These results surpass other LLM-based baseline models on gpt-3.5-turbo (chatgpt). TabSQLify can reduce the table size significantly alleviating the computational load on LLMs when handling large tables without compromising performance.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2404.1015

Country:

Europe (1.00)
Asia (1.00)
North America > Canada > Saskatchewan > Saskatoon (0.14)
(2 more...)

Genre: Research Report > Promising Solution (0.48)

Industry: Leisure & Entertainment > Sports (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Automated User Story Generation with Test Case Specification Using Large Language Model

Rahman, Tajmilur, Zhu, Yuecai

arXiv.org Artificial IntelligenceApr-1-2024

Modern Software Engineering era is moving fast with the assistance of artificial intelligence (AI), especially Large Language Models (LLM). Researchers have already started automating many parts of the software development workflow. Requirements Engineering (RE) is a crucial phase that begins the software development cycle through multiple discussions on a proposed scope of work documented in different forms. RE phase ends with a list of user-stories for each unit task identified through discussions and usually these are created and tracked on a project management tool such as Jira, AzurDev etc. In this research we developed a tool "GeneUS" using GPT-4.0 to automatically create user stories from requirements document which is the outcome of the RE phase. The output is provided in JSON format leaving the possibilities open for downstream integration to the popular project management tools. Analyzing requirements documents takes significant effort and multiple meetings with stakeholders. We believe, automating this process will certainly reduce additional load off the software engineers, and increase the productivity since they will be able to utilize their time on other prioritized tasks.

large language model, natural language, user story, (15 more...)

arXiv.org Artificial Intelligence

2404.01558

Country:

North America > Canada > Saskatchewan (0.14)
North America > Canada > Quebec > Montreal (0.14)

Genre:

Questionnaire & Opinion Survey (0.94)
Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.68)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

MugenNet: A Novel Combined Convolution Neural Network and Transformer Network with its Application for Colonic Polyp Image Segmentation

Peng, Chen, Qian, Zhiqin, Wang, Kunyu, Luo, Qi, Bi, Zhuming, Zhang, Wenjun

arXiv.org Artificial IntelligenceMar-31-2024

Biomedical image segmentation is a very important part in disease diagnosis. The term "colonic polyps" refers to polypoid lesions that occur on the surface of the colonic mucosa within the intestinal lumen. In clinical practice, early detection of polyps is conducted through colonoscopy examinations and biomedical image processing. Therefore, the accurate polyp image segmentation is of great significance in colonoscopy examinations. Convolutional Neural Network (CNN) is a common automatic segmentation method, but its main disadvantage is the long training time. Transformer utilizes a self-attention mechanism, which essentially assigns different importance weights to each piece of information, thus achieving high computational efficiency during segmentation. However, a potential drawback is the risk of information loss. In the study reported in this paper, based on the well-known hybridization principle, we proposed a method to combine CNN and Transformer to retain the strengths of both, and we applied this method to build a system called MugenNet for colonic polyp image segmentation. We conducted a comprehensive experiment to compare MugenNet with other CNN models on five publicly available datasets. The ablation experiment on MugentNet was conducted as well. The experimental results show that MugenNet achieves significantly higher processing speed and accuracy compared with CNN alone. The generalized implication with our work is a method to optimally combine two complimentary methods of machine learning.

artificial intelligence, machine learning, mugennet, (18 more...)

arXiv.org Artificial Intelligence

2404.00726

Country:

Europe (0.93)
Asia > China (0.28)
North America > Canada > Saskatchewan (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area > Oncology > Colorectal Cancer (0.87)
Health & Medicine > Therapeutic Area > Gastroenterology (0.87)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Modified CycleGAN for the synthesization of samples for wheat head segmentation

Myers, Jaden, Najafian, Keyhan, Maleki, Farhad, Ovens, Katie

arXiv.org Artificial IntelligenceFeb-23-2024

Deep learning models have been used for a variety of image processing tasks. However, most of these models are developed through supervised learning approaches, which rely heavily on the availability of large-scale annotated datasets. Developing such datasets is tedious and expensive. In the absence of an annotated dataset, synthetic data can be used for model development; however, due to the substantial differences between simulated and real data, a phenomenon referred to as domain gap, the resulting models often underperform when applied to real data. In this research, we aim to address this challenge by first computationally simulating a large-scale annotated dataset and then using a generative adversarial network (GAN) to fill the gap between simulated and real images. This approach results in a synthetic dataset that can be effectively utilized to train a deep-learning model. Using this approach, we developed a realistic annotated synthetic dataset for wheat head segmentation. This dataset was then used to develop a deep-learning model for semantic segmentation. The resulting model achieved a Dice score of 83.4\% on an internal dataset and Dice scores of 79.6% and 83.6% on two external Global Wheat Head Detection datasets. While we proposed this approach in the context of wheat head segmentation, it can be generalized to other crop types or, more broadly, to images with dense, repeated patterns such as those found in cellular imagery.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2402.15135

Country:

North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.30)
North America > Canada > Saskatchewan (0.28)

Genre: Research Report (0.64)

Industry: Food & Agriculture > Agriculture (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding

Wang, Zilong, Zhang, Hao, Li, Chun-Liang, Eisenschlos, Julian Martin, Perot, Vincent, Wang, Zifeng, Miculicich, Lesly, Fujii, Yasuhisa, Shang, Jingbo, Lee, Chen-Yu, Pfister, Tomas

arXiv.org Artificial IntelligenceJan-18-2024

Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Tables are a popular data format and widely used in daily life (Cafarella et al., 2008). Understanding tabular data with language models can benefit various downstream tasks, such as table-based fact verification (Chen et al., 2019), and table-based question answering (Jin et al., 2022). Distinct from pure text, tables deliver rich information through the interaction between rows and columns in the tabular structure, which enhances the data capacity but also increases the difficulty for language models to understand them. Thus, reasoning over the tabular data is an important direction in natural language processing and attracts increasing attention from both academia and industry. In recent years, several approaches have been suggested to tackle the problem of table understanding by training language models. One common direction is to add specialized embedding layers or attention mechanisms into language models and pre-train the models by recovering table cells or segments (Herzig et al., 2020; Wang et al., 2021; Gu et al., 2022; Andrejczuk et al., 2022).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2401.04398

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Passenger (1.00)
Transportation > Air (0.93)
Consumer Products & Services > Travel (0.93)
Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AI weed-killing drones poised to reduce herbicide use while cutting costs

The Japan TimesApr-20-2023, 05:22:36 GMT

For the past three years, Terry Aberhart has watched the spindly, fixed-wing drones zip across the big skies above his farm in Canada's Saskatchewan province, testing a technology that could be the future of weeding. Fitted with an artificial intelligence system, the drones are designed by local startup Precision AI to spot, identify and kill the weeds without drenching the entire crop in chemicals. "I'm on the list for one of the first machines when they become available," says Aberhart, a sustainable farming enthusiast. "The current technology is designed for maximum coverage and to hit everything in the field." This could be due to a conflict with your ad-blocking or security software.

ai weed-killing drone, artificial intelligence, canada government, (2 more...)

The Japan Times

Country: North America > Canada > Saskatchewan (0.66)

Industry:

Materials > Chemicals > Agricultural Chemicals (0.40)
Food & Agriculture > Agriculture > Pest Control (0.40)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback